DOCOHENT RESOME 



ED 107 629 95 SP 009 252 

AUTHOR Winne, Philip H. 

TITLE A C.'itical Review of Experimental Studies of Teacher 

Qu€;stions and Student Achievement, 
INSTITT^TION Stanford Oniv,, Calif, Stanford Center for Research 

and Development in Teaching, 
SPONS AGENCY National Inst, of Education (DHEW) , Washington, 

D,C, 

PUB DATE [75] 

CONTRACT NE-C-00-3-0061 

NOTE 21p, 



EDRS PRICE MF-$0,76 HC-$1,58 PLUS POSTAGE 

DESCRIPTORS *Academic Achievement; Affective Behavior; 

♦Educational Experiments; Educational Research; 

Literature Reviews; Questioning Techniques; Research 

Design; Research Methodology; ♦Research Problems; 

♦Teacher Influence; ♦Teaching 



ABSTRACT 

The purpose of this paper is to summarize and 
evaluate experiments which examined the effects of teacher questions 
on student achievement. The studies reviewed are of two types: (a) 
training experiments, in which the independent variable is teacher 
training; and (b) skills experiments, in which the frequency and 
manner of use of a teaching skill is prescribed by the experimenter. 
The first sectr.on of this paper presents brief overviews of both 
training and skills experiments. Each overview lists (a) grade, (b) 
subject, (c) independent variable, (d) dependent measure, (e) 
teaching time, (f) analysis and results, (g) comments, and (h) 
conclusions. The second section discusses the experiments and 
presents suggestions for improving the quality^ of research on 
teaching. These suggestions include the following areas: (a) 
reporting the study, (b) design, (c) analysis, (d) dependent 
measures, and (e) general questions of method. The last section 
presents conclusions gathered from the studies reviewed and warns of 
misleading re53arch supported by superficial claims of valid 
methodology, ^PB) 



ERIC 



A CRiriCAL RHVJiV OF I-IXPEK i 'IKWTAL STUDIES 



OF TRAC 



QUESTIONS AMD SlUDJiliT ACIiJEVEMGKT 



us OfPA»ITMENTOFHCAtTH 
tOUCATtON aWCLFARC 
NATIONAHNSTITUTCOF 
KOUCATiON 

THIS DOCUMENT HAS BEEN REPRO 
DUCEO EXACTLY AS RECEtVEO FROM 
THE PERSON OR ORGANIZATION ORIGIN 
ATlNGtT POINTS OF VIEW OR OPINIONS 
STATEO 00 NOT NECESSARILY REPRE 
SENT OFFICIAL NATIONAL INSTITUTE OF 
EDUCATION POSITION OR POLICY 



Phil ip K. Winnc 
Stanford Ctnilcr tor Ile.voarch and Development in Teaching 
Stanford University 



Tbfire can bo little doubt thiil teacher questions are accuned to be 
iirportv'int rrictors infjuencing studtnil achievu^fUt • Indeed, in a prvn^icms 
review' ul the topic, Gcill (19V0) labeled this jstatement a "truism.*' In the 
lacU- decade, efforts liave been v.ade to test iJ.iis belief vitlj e::peri:''Ontr> . 
Tlie purpose of this paper is to yurur.iri7.G apd evalua.te the experiments 
\:li\ch exainiis vi the effects of teacher questions on student acliioveLient • As 
IICcMh and [ili Ifjun (191 '{) ishowed , research on teaching often e:ihibits fla\^s 
in nelhod. This paper iueiitxfios such errors in the doDiain of research on 
te^u-'her questions and ore /ides son^e suggestions for improving research on 
teaching in general. 

The studies revie^jed here are of two kinds. Tr^nj ngi exner imeiits^ are 
studies in which the independent varicible is merely teacher training. 
Following training, terchers are free to use the skill (s) on which they were 
trained at their discietion in ueach.Lng, Since the skill(c) are not 
necessarily used with the same frequency or in the same manner by different 
teachers within a treacncnt group, it is incorrect to label the skil.l itself 
the independent variable used in these studies. 

In contrast, skills exper iments are studies in which the frequercy and 
iranner ol use of a teaching skill is prescribed by the experimenter, Thus, 
the teaching skill is the nominal independent variable in these studies. 
It ir.ay not be the actua] independent variable, however, if teachers within 
a treatment group who should be following the experimentally prescribed 
use of the teaching skill vary in their actual delivery of the teaching act. 

The next, tv/o s^jcMons present very brief ovcrviev/s of training experi- 
ments and skiJis experiments, respectively. The last two sections present 
sugge:;tions for iniprovLnij tlu-: quality of research on teaching and a summary 
ol my conclu:>Leiis gathiired fcom the studies reviewed here about the cffecrs 
of teaclier questions on student achievement. 



The research reported heroin was conducted at the Stanford Center 
j'or Ik:"earch and De vc lopmc»nt in Teaclring, which L5; supported in part by 
the National 'nstitute of T.ducat ion. Department of Health, education, and 
Welfare. Th.e opinions oxprvs^'i^d in tliis draft do not necessarily reflect 
liK' :)OF.Ltion, ]H)Licy, or cn^kir ,emcnt of the N.iiional Institute of I'lducalion, 
(Contract Xo, XK-C-bo-'V-OO^I . ) 



Yji^iicAla (1972) 

Grade: 9-12 (iiuxcd [\r3<\L' cl'asr>roonicO 

Subject: Amoric^ln hiMory, world history, U.S. government 

lnuv:pondont Variable; Intern training plus feedback on classrooni perf oraance 
lor conviir^^ent aiul divergent questions vs. no training or feedb:sck. 

Dependent Measure: lov;a Test of I'Iducational Development, Ability to Inter- 
pret Reading MaterLah; in the Social Studies (80 items, multiple 
ehoice) ; Sequential Tests of Educational Progress, Social Studies 
(70 itcTfu;, multiple choice); VJatson-Glascr Critical Thinking Appraisal 
(100 objecfive itemi^). 

Teaching; Time: 50 minutt: lessons over 8 weeks. 

Analysis & Results; 

1. ANCOVA of ITED showed no treatment differences (F^ ^ 3.09, 
p > .05) . 

2* AKCOVA of STEP showed no treatment differences (F^ ^^^^ = 1.81, 
p > .05). 

3. Al^COVA of Watson-Glaser showed no training > trained group 
(F^ 40] " ^-^^^ ^ ^ 

NOTE: parallel pretest served as the covariate in each case. 

Comments : 

1. Analysis of the observational data from a single lesson at the end 
of 8 weeks of training is not representative of teaching over the 
full treatment period. 

2. /vNCOVA assumptions are not mentioned, especially homogeneity of 
regression. 

3. Students as units of analysis are not independent sampling units. 

4. Standardized measures of achievement are designed for high stabil- 
ity coefficients and, thus, are not likely to reveal treatment 
differences . 

5. Reanalysis of o])servational data using the mean number of types of 
questions (vs. author's use of total number of types of questions) 
shov/s no relation between treatment and teacher acts in the single 
lesson observed. 

6. There probably are some differences on the ITED measure, p = .079 
(reported by author) assuming his analyses acceptable. 

Conclusions : 

1. The treatment delivered to students is unknown. 

2. Analyses nrc inaccurate. 

3. There is a poor choice for the dependent m.easures. 
/*. The reported ro:'>i;its are not valid. 



n 



3. 



U^Fich c t ;i 1 > : St ud v A 

Grade: (mixed i;radi- lesson groups) 

Subject: Science, "\jny bird:^ sing'* 

IndopcivJfMit Variable: Inl-crni. ivre loJd fo leach for factual recaJI vs. 
concept i'lastery in a 1 hour orientation meeting. 

Dependent :^a->ure: 12 iten recall test (KR^^ - .38); 10 Ltem concept 
tnaF;tery ler.t (^^^20 -52). 

Teaching Tii.^e: One 3 5-30 minute les.son. 

Analysis i Results: VJCIOVA showed significant differences for teacher 
obiectives (F, - A. 20, p < ,025) with recall ^roup > concept 
mastery group on recall test, and significant differences for grade 
(F^ t= 9.12, p < .001) with grades 3 and 2 > grade 1 on concept 
mastery test. 

Coraments : 

1. Analysis of the observational data showed that the concept na^tery 
group of interns as'-.ed significantly fewer knowledge quest lops, 
sigaificantly irore higher cognitive questions, but no interrater 
agreement coefficient was given. o 9 9'- 

2. Cell sizes lor MAN^OVA are in the ratio 1:1: 1.25 : ].50 : 2 : ZUj 
with largest call variances in the ratio 1 : 7.3 (recall test), 1:14 
(concept mastery test). 

3. Dependent measures have very low reliabilities. 

4. Teaching delivered to students is ill-defined, very short. 

Conclusions: 

1. The treatment delivered to students is unknown. 

2. Analyses are inaccurate. 

3. The dependent mea.^ure is not trustworthy. 

4. Teaching time io insuf f icienu for meaningful results. 

5. Reported results are not valid. 



Lvnch et al.: St udy _B 

Grade: 4-6 (mixed grade lesson groups) 

Subject: Symbol recognition for artificial code. 

independcMit Variable: mterns were told to t<^ach for factual recall vs. 
concept mastery in a 1 hour orientation meeting. 

Dependent :ien<uirc: 24 item knouledge-recal 1 test (i^V.^^ - -79); 31 item 
concept mastery test ^^^^20 •^f'*)* 



Teaching; Tiir.e: Oikj 30-40 minute lesson. 

Analysis 6 RunulLr. : M/\XOVA ^Axnicd si^^jplf icnnl difference's for toacl^er 
obj'w-rtivcs (F^ = 9*A6, p < .001) with recall ;j;roiip > concept 
masLery group on l'vno\;] ed^^e-^recall tt»st (F^ ^ 17.32, p < •001), 
and Lj',ni£icant: Jlfferenceri for grad(; (F^ = 2,76, p < .05) \/ith 
grade 6 > 5 > 4 on knowledj-.e-recall lesL (i^ ^- 6.19, p < .01). 

CoiMjieiits ^ Cc»ncl us J on*; : 

1. Si'!:c:e Gtudy !5 \/ar; a »r.cl]K'»doli>gj ral twin of study A, the same 
confronts and conclusionn apply. 



miett (I%7) 
Grade: 8-12 

Subject: Social studies, "the McCarthy hearing" 

Tndcpendenl Variable: Interns v;cre, trained by oral clir>cussion v.s. video 
model vs. oral discussion plus video iPodel vs. no training. 

Dependent ncasure: 12 item short essay test (split half coeffJcjenl = .82). 

Teaching Tinsc: One 9-23 minute lesson. 

Analysis & Results: AKOVA of posttest showed no differences 
(^•3,30 =^0.83, p > .05). 

Coinniants t 

1. Analyses of observational data are slightly misleading since some 
teachers who did not give the achievement test v/ere included in 
observational data* 

2. Testing time ranged from 7 minutes to 20 minutes, with a mean of 
about 12 minutes. Thus, the test is relatively speeded and 
gives unequal opportunity for students to show v;hat they learned. 

3. Variation in teaching titr.e makes comparisons across groups 
difficult since treatments probably varied as a function of time 
for the lesson. 

Conclusions : 

1. Treatment differences between groups are relatively unkno\m. 

2. Variable teaching; time plus variable testing time make comparisons 
of le:;son ?;r(>iir3 \;if.hin trcatrr.ents and betwcen-group treatment 
compar I sons untrustworthy. 

3. Re|>orted results arc not valid. 



5. 



Hf)n(;rs & Day i f . (J ,971) 
Grade: 5 

Suhji^cl: Social rUculios, 'J'ho VJost fiMlies 

IndoiHPdcnl Variable: InL-.Ma tralnin:.; on asking lui^hcr co^yiitivo 
fjuc^il ions V*. • no LraJi.lnr,. 

D-i'P^ '^casur -: 35 it:c:n n'altli)lo cl.olce Cast {u'a.*;pociric.'d rcliabiJity 

coc'fficiciiL " .75); 5 items for eacli of scvon categories of questlonr. 
Iropi Sand.'i:;, alao cou^; JciC'rc»d as saparaLc sulv:calcs. 

Teacliiag Time: Vour 35-40 i^inutc les5*.oaci. 

Anaiv^>ib 6 KesulLs: ANOVA .s]»o-ed no sL-nificant dil fcrcnces for total 
rest (K^ - 2.71), neir.ory subtest: (F^^53^ - 2.01), translation 
subleftt (l\ = .00), interpretation subtcv-t (F^ co-i = -^O)* 

1 ,5.51 -L , 

application subtest (I-*^ - .32), synthesis subtest " ^'^^^^ 

and evaluation subtest (F^ = .00); significant treatment differ- 
ences for analysis subtest' (F^^53^ = 1A.77) with untrained group > 
trained group. 

1. Seven separate analyses of observational data performed tor eacn 
category' of questions are not independent since (a) the same 
sap.ple of teachers ds used for each analysis and (b) tlie data 
were proportions so that the sum of seven proportions must 

total 100%. - 

2. Measuring teachers' use of questions by proportions may be 
misle^idJiig; the largest absolute difference for a type of 
^question may be less than 2 questions per lesson if teachers 
asked 20 questions per lesson. 

3. The analyses incorrectly use students as the unit of analysis 
s-^nce they are not independent units in this Resign. 

The reliability of 5 item subtests is very low (roughly .12 by 
the Spearman-Rrown formula, but it is likely this figure is 
slightly misleading). 

Conclusions: 

1. Analyses are inaccurate. 

2. Treatment variation within experimental groups makes comparison 
across groups difficult . 

3. Subtest analyses are not trustv/orthy . 
A. Reported results are not valid. 



'6. 



/W.anrd (1973) 
Grade : 3 1 

SiihjecL: Clu i.^istry , rndioaclJ v i ty and rrullat'ion 

Jndependcnl Variable: Scripted ]e.s5;on.s with 250 lii^dicr cognitive questions 
vrf. 310 l'aio\;led3?« questions vs. no teacher initiated questions* 

DeperidcMit Measure: itovi r:uJtij}lo choice test 0^^20 ^ 

Teachinp TLnc: Ten 60 minute lessons over tv/o weeks* 

Analysis & Results: 

1. /\:iOv;\ ol' pretest (same measure as posttest) showed no differences 

2. ANOVA of posttest showed a significant treatment effect 

(1'^ = 8.30, p < .01); Scheffc contrasts showed knowledge < 
2 , 734 

higher cognitive questions* 
3* ANOVA of gain score sho^:ed a significant effect (F^ "6.98, 

p < .01); Schcffe contrasts sliov/ed knowledge questions < higher 

cognitive questions (p < *05), nean of no questions plus knowledge 

questions < higher cognitive questions. 
4* /uvOVA of unspecified TQ measure showed significant differences 

^^2,734 " ^ ^ 

5* Multiple regression analysis of gain score showed a significant 

increase in for the model with groups plus IQ vs. groups only 
(increase = .031, " ^3*99, p < .01). 

6. A priori contrasts of gain score residualizcd on IQ showed no 

questions < higher cognitive questions (F2 ^^3 ^^-80, p < .01), 
knowledge questions < higlier cognitive questions (F^ 1^.40, 
p < *01), and mean of no questions plus knowledge questions < 
higher cognitive questions (F^ ^^3 31*10, p < .01)* 

Comments : 

1* Scheffe contrasts from the ANOVA of protest shov;cd no questions > 

knowledge questons (p < .01), no questions > higher cognitive 

questions (p < *01) . 
2. Sch.offe c.onlrasts from the A::0VA of TO showed no questions > 

kno\;lodge que^^ti<Mis (p < .03)* 
3* nifforences favoring tiie no questions group for both pretest and 

JO measures suj;gest th;U .a finding of no differences on posttest 

would show liKit th(> treatment liad an effect. 



i\. Ob:>('i vation/ii flaln ir, inadouuate; only 8 of 30 lcs;?ons or 217, 
of treorhing t ii^u \;as obsorvccl, The^Jo ohscrvaL ions \/orG sampled 
unsysLopjat icnl ly ♦ 

5, Sludents is not Llic corn^ct choice for Lhe unit of analysis 
rAucc nciirlKr strjcnls nur classroor:r, \;ctc rancior^ly asHJi^'.ned 
to tree imciiLs. 

6. An.alysos ti.c-rlnr. y^.un scort\s are unreJialjJo. 

?♦ No corrt'laLio:) bjiv/c.-cn TQ ar.d ponlLc^^iL is given. 

8. A iix-an gain of unly 7 iteir.s over 10 lessons suggesLs that 

te<'i:'.*ii.i» , tne v:itrr iciilniri, or some otlicr factor inh ibited the 

ef fectivenesf* of tljc lessons. 

CC'ncJus ion^. : 

1. Analyses are inaccurate. 

2. Tiic treati^ent delivered to students is relatively unknown. 

3. Reported results are not valid. 



Bugf;ey (197 1) 
Grade: 2 

Subject: Social studies; one unit on rules and a second unit on locations. 

Independent Variable: 70% hjgher cognitive questions vs. 30% higher 

cognitive questions vs. no instruction on curriculum; sex; urban vs. 
suburban school location. 

Dependent Measure: Sum of scores from two 30 item multiple choice tests 
on each of the two units 0^^20 " '^^^ sumned scores). 

Teaching Time: 8 minute lessons over 3 weeks on unit 1, and 8 

minute lessons over 3 weeks on unit 2. 

Analy5jis & Kesults: A\OVA of posttest showed significant differences for 
teaching method (F^ = 269.99, p < .01) and school location 
(F^ = 10.89, p < .01); Neuman-Keuls contrasts showed 70% higher 
cognitive questions > 30% higher cognitive questions and no 
instruction (p < .01 for both), 30% higher cognitive questions > 
no instruction (p < .01). 

Cominen t s : 

1. Since KR^^ is a measure of internal consistency, it is curious 
that a 60 item test co^jposed of tv;o srpposedly different subscales 
measuring differeat content has such a large coefficient. 

2. There v;as no observation of teaching. 

3. Since the first unit was tauglit by one teacher and the second unit 
by a differo'U" teaciior, r.^sul ts may reflect a topic by teacher 
inLeracM'on or waiM-up effect. 

A. Using a no instruction group as control vastes resources. 

The threat of reactive testing effects by administering a pretest 
seems minor. f 



8. 



Concilia i ons : 

1. Analyses nrjv confound trcatmouL differences v/ilh a teacher by 
topic inleiacLion, 

2. The treatment delivt^rt d to studentii was not observed, but since 
the uxperiir^Mit (?r \/as oui- teacher (the other t\;o verc also Pii.D* 
candidates doing a rcT>l1cation or c:xtension of this design), it 
seop'S rclativtMy safe io assume the trealnuMit uas known. It Xv'as 
neither .»naly/ed nor rcp^'V' ad, hov/cver, 

3. Results probably are Vciliu as reported. 



Chur ch (19 70) 

Grade: Standard 4 (approximately Grade 10-11) 
Subject: Science, electricity 

Independent Variable: 

1. Study A: 171 primary questions vs. 53 priiiiary questions. 

2. Study B: 65% open primary questions for 110 ninute (long) lessons 
vs* 65% open primary questions for 66 minute (short) lessons vs. 
35% open primary questions for 70 p.inute (short) lessons. 

3. Study C: Teacher response to secondary questions: prompts vs. 
extensions vs. teacher gives answer. 

4. Study D]^: Number of questions, Q and Q/2 (actual number not specified). 

5. Study D2: Number of questions, 171 primary questions vs. ^Veduccd as 
far as possible" (actual number not specified) . 

Dependent Measure: Achievement test corrected score (correction measure and 
method unspecified) . 

Teaching Time: 

1. Studies A, D^;^, D2: 3 lessons of varying length in minutes. 

2. Studies 13, C: 4 ^lessons of varying length in minutes. 

Analysis & Results: 

1. Study A: 171 primary questions (X = 36.3) > 53 primary questions 
(X - 31.9). 

2. Study B: 65% open questions long lesson (X = 31.4) and 35% open 
que'^.tions short lesson (X = 31.2) > 65% open questions short lesson 
(X = 27.9). 

3. Study C: prompting (X ^ 31.4) > extension (X - 29.0) ^ teacher 
gives answer (X = 27.9). 

4. Study Djt Q questions (X 39.8) 3 Q/2 questions (X = 38.0). 

5. Study D'): 171 primary questions (X 36.3) > "reduced as far as 
possible" (X - 33.4) . 



.9. 



Comm'^nts : 



1. "Thc iwasure ami nu'thod used for correct Ln-- achievement test 

scon"5 arc not specifiocl. 
2 There- is no presentation of basic <u.atjstica1 i nforn^ition , e.g., 
of analyr.Js, standard d.-viations, fuimplc .ix.e, inf erenfLal 



un 1 



tests, of hypotlK'f;es . 
3 Tiie .<;tudv ur.od a "niddlo group" of student, fro,, c] a^rooir... 

its sa-npic; this restricts fne ran^c of individual differences 
and liiiiits; ^.,cnoralizabil i ty due to unrepresentat i veness of tl-e 

sample, . ^^♦.^.v 

/4. For every corrected mean score in a.1 1 the studies the greater 
(j.,rcatcst) mean is' associated vith lessons takina longer .ino, 
the differences in average time for la.^.sons range 1 rom 1 mxnute 
to 59 nnhutcs. 



Conclusions: 



l! Tl,; reported results contribute little to knowledge about the 
effects of teacher questions on student achievement. 



ERIC 



>ta_rtikean 09Ti)_ 
Grade: 3-4 

Subject: Science, plants and seeds 

independent Variable: 107 higher cognitive questions plus 9 knowledge 
Questions vs. 5 higher cognitive questions plus 52 knowledge . 
questions . 

Dependent Measure: 11 item objective achievement test, 
leaching Time: Not specified, presumably 1 lesson. 

Analysis & Res\ilts: 

1. t-test on parallel pretest means showed no difference 

(t29 = .07, p > .05). 

2. t-test on posttcst means showed no difference 
(t29 = .21, p > .05). 

3. t-test on mean gain scores showed no difference 

(t2cj = .1^. P > -05). 

Comments :_^^^^ ,,,,ence of basic statistical information, e.g., standard 
* deviations, reliability coefficients, limits interpretation. 

2. No obsorvatlon of teaching. . ,11 

3. The t-tost on gain scores is unreliable; the analysis should 
have been a t-tesf for correlated samples. 

A. The ratio of questions is approximately 2:1. This suggests that 
time variod considornbly» 



Conclusions: 

!♦ Ttie LrealTncnt. dalLvered to students' is unknov;n« 

2. The; results; arc reiaLivoly iinlnte^-prctable duo Lo poor rfcporcing. 



Ryo n (1973 ) 
Grade: 5 

Subject: Social studios, geography 

lnd(^j)onderit Variable: 75% hif-'lier cognitive qucvjtlous vs, higher 
cognitive questions vs, no instruction on curriculunu 

Dept i.dent *Miar.ure: 58 itein knowledge-recal] multiple choice test (^'^^20 " ♦^^^> 
itei^i higher cof,nitive question multiple choice test 

Teaching Time: 9 minute lessons over 2 weeks. 

Analyses & Results: 

1. AMOVA of kno\7lodge questions posttest shouted significant treatment 
effect (F^ X03 21 •37, p < .01); Ncuman-Keuls. contrasts sho^./ed 75% 
higher cognitive questions and 5% higher cognitive questions > no 
instruction (P < ,01 for both), 

2. AIn'OVA of higher cognitive questions posttest showed significant 
treatment effect (F^ io3 ~ 5*^0, p < .01); Neunian-Keuls contrasts 
showed 75% higher cognitive questions > no instruction (p < .01). 

3. ANOVA of knowledge questions retention test showed significant 
treatnenu effect (F^ ^ 16.15, p < .01); Neuman-Keuls contrasts 
showed 75% higher cognitive questions and 5% higher cognitive 
questions > no instruction (p < .01 for both). 

4. ANOVA of higher cognitive questions retention test showed 
significant treatment effect (F^ -^^^ 5.64, p < .01); Neuman- 
Keuls contrasts showed 75%. higher cognitive questions > no 
instruction (p < .01) ♦ 

Comments: 

1. The author states teachers were observed occasionally, but presents 
no data on their adiierance to treatment. 

2. Each treatment \v'ns delivered by only one teacher; treatment effects 
are cnnfountled with Leacliers and can he attribp.ted to treatments, 
tcaciiers, or a treatniont by teacher interaction. 



3, Tiir author j ilsinterprot .s the data; unrcOiablo differences 
(p > .10) are claimed to .shov; conr, isStont effects. 

^. There is insuTf i cj ent iiifonnation about how ] on<', students had 
to respond to a total of lOA multiple choice items. This 
raises the ''i.u aion of tt^st spcedt.xlnfiss , If the tests were 
spe(uled, ^^^'20 ^ ^ ^'^^^^ spuriou^>iy hip> and the analyses lack 

power, 

5, A no instruct !oa group Ls a waste of resources. 
Cone] us ion <^ : 

1. Tiiure is confounding of treatment with teacher, 

2. Tlie dependent r^jasure I'lay be unreliable, 

3. Tiie results are not interpretable regarding the effects of 
questions, 'Jhe only differences are between students who studied 
the curriculum tliey were tested on and those \. i didn^U study 
this curriculum. 



Ryan (1974) 
Grade : 5 

Subject: Social studies, geography 

Independent Variable: 75% higher cognitive questions vs. 5% higher 
cognitive quesuions vs. no instruction on curriculum. 

Dependent Measure: 58 item knowledge-recall multiple choice test 

(KR^Q 89), 46 item higher cognitive question multiple choice 
test (KR^o -S^). 

Teaching Time: 9 minute lessons over two v^eeks. 

Analyses & Results: 

1. ANOVA of knowledge questons post test showed significant 

treatment effect (F^ = 36.03, p < .01); Neuman-Keuls con- 

2 , lOA * 

trasts showed 75% higher cognitive questions and 5% higher 
cognitive questions > no instruction (p < .01 for both). 

2. ANOVA of higher cognitive questions posttest showed significant 
treatment effect ^^q^ = 5.24, p < .01);' Neuman-Keuls 
contrasts shox;ed 75% higher congnitve questions > no instruction 
(p < .01), and 5% higher cognitive questions > no instruction 

(p < .05). 



12. 



3. ANOVA of knowlodijc questions retention test shov;ed significant 
treaLnicnt effect ^04 " 20.87, p < .01); Neuman-Kculs con- 
trasts showed 75% higher cognitive questions and 5% higher 
cognitivo questl(jn^5, > no instruction (p < .01 for both). 
ANOVA of higher cognitive questions retention test sho\;ed 
significant trcctinent effects iq/ ^ ^ ' t) •Ol); Neuman- 
Keuls contrasts showed 75% higher cc a. questions and 5% 

higher cognitive questions > no instruction (p < .01 for both). 

Comments : 

1. Since this study is a methodological twin of Ryan's 1973 study, 
the sane comments apply. 

2. The df in the renort are consistently 1 less than they should be 
for MSg. 

3. The. auth.or stctos incorrectly that all three groups were run 
concurrently; the control group for 1974 v7as the same as that for 
1973 (confirired by personal communication, F. Ryan, February 20, 
1975). 

Cone 'us ions: 

1. The results do not contribute to knowledge about the effects of 
questions . 



Savage (1972) 
Grade: 5 

Subject: Social studies^ one unit on rules and a second unit on locations. 

Independent Variable: 70% higher cognitive questions vs. 30% higher 

cognitive questions vs. no instruction on curriculum; sex; urban vs. 
suburban school location. 

Dependent Measure: Sum of scores from two 30 item multiple choice tests 
(KR^g = .84 for summed scores). 

Teaching Time: 8 minute lessons over 3 weeks on unit 1; 8 minute 

lessons over 3 weeks on unit 2. 

Analysis & Results: \NOVA of post test showed significant differences for 
teaching method (F^ = 80.84, p < .01) and sex (F^ = 2 4.95, 
p < .01) with females > males. Neuman-Keuls contrasts showed 70% 
higher cognitive questions and 30% higher cognitive questions > no 
instruction (p *^ .01 for both). 



• 13. 



Comments & Conclusions: 

1* Since this study is a methodological twin of Bugj>ey's (1971), 
the same comiiu-nts and conclusions apply. The reported results 
are prohab]y valid. 



Tyler (1 9 71) 
Grade : 2 

Subject: Socia] studies, one unit on rules and a second unit on locations. 

Independent Variable: Teacher asked questions vs. students read questions 

(70% higher cognitive and 30% knowledge questions were identical for 

both) vs. no instruction on curriculum; sex; urban vs, suburban 
school location. 

Dependent Measure: Sum of scores from two 30 item multiple choice tests 
(KR2Q ~ ,84 for summed scores). 

Teaching Time: 8 minute lessons over 3 weeks for unit 1; 8 minute 

lessons over 3 weeks for unit 2. 

Analysis & Results: MOVA of posttest showed significant differences for 
teaching method (F2^^Qg = 121.95, p < .01); school location 
^^1^108 ~ 97.40, p < .01), with suburban > urban; treatment x school 
location (^2^108 ^ P ^ *05); and sex x school location 

^^2,108 7.23, p < .01); Neuman-Keuls contrasts for teaching method 
showed teacher asked questions and student read questions > no 
instruction (p < .01 for both), teacher asked questions > student 
read questions (p < ,05), 

Comments £t Conclusions: 

1. Since this study is a methodological twin of Buggey's (1971), the 
same comments and conclusions apply. The reported results are 
probably valid. 



14. 



Discussion 

The four traindnj', r^tudies and eight sldlls studies examined in this 
review hip.li] i^^iit stveraJ methodolo.^ical f3av;h that are probably comraon 
to Diuch research on Leaching* The correction of these flaws and more 
astute consideration of the limitationo of method should become pronlnent 
in future investigations. These issues are briefly surroDarized in the 
following. 

iiSi^iL^LV^OllL ?J^^^ Several of the studies sufi from inadequate 
reportin^.^fraining of teachers should be described briefly so that 
meaningful comparisons can be made of studies which examine the same or 
similar variables but use different training methods to get teachers to 
use the teaching actions under investigation. Reference should be made to 
documents used in training, the length of training, and standardized 
training exercises such as microLeaching. Of particular importance is the 
need to explicitly and exhaustively describe the teaching' behaviors trained 
as well as^ t hose untraiaed. I recommend that separate reports fully 
describing the training be cited and made available. This will have the 
triple benefit of saving space in journal articles, of fully describing the 
independent variable, and of contributing to knowledge about the effective- 
ness of various training techniques for particular teaching acts. 

A second limitation of reporting obvious in several studies is the 
insufficient presentation of descriptive statistics, including standard 
deviations, reliability coefficients, and correlations between measures 
used as covariaces or residualizing variables and post test scores. Those 
statistics contribute much to permitting a reader to form his own inter- 
pretation of the results. 

Design. This review of experimental studies in a limited area of 
teacher~f fectiveness corroborates the finding of Health and Nielsen (1974) 
that many research efforts are methodologically inadequate. All research 
on teacher effectiveness should include observation of the teaching that 
takes place. Without this component, the actual treatment of an experiment 
is unknown to a degree that casts serious doubt on the validity of infer- 
ences drawn from the data. Furthermore, a simple statement that teachers 
adhered to the definition of the treatment is insufficient. The degree of 
variation from the treatment as well as the characteristics of the varia- 
tion(s) may be critical in judging what produced the observed results. 
Therefore, research studies sliould include a description of N^ariations in 
the treatment and, where space permits, a formal analysis of the degree of 
variation should be presented. In the absence of available space, a 
citation to a document containing such analyses should be available to 
supplement a summary presentation of the analyses in the published paper. 

Studies should be designed to allow an estimate of the variance in 
the dependent measure attributable to variation between the teachers. 
This seems accomplished most easily by making teachers a factor in the 
experimental design tliat is fully crossed with treatments. Failure to in- 
clude this factor will usually leave treatment effects fully or partially 
confounded with teacher effects, thus confusing the interpretation of why 
the results turned out cis they did. 



- 15. 



The arguments over whether to use clasyrooms (or teachers) or indi- 
vidual students as the unit of analysis are complex. The simplest reso- 
lution of the choice is the following. In analyses like multiple regres- 
sion, analysis of variance, analysis of covariance, and the like, there 
must not l)..^ a systcj i.:t j c relation heLwoen the units of analysis and any 
factor iu the design to avoid the problems of correlated errors. Assign- 
ment of cin Intact classroom to a treatment and then using students as the 
units of analysis probably violates this dictum for almost any exneririen- 
tal factor. At the least, there probably exist patterns of social inter- 
ation within an intact classroom that influence student response tenden- 
cies and attentional factors. Since the ideal condition of randomly 
assigning all students to a treatment is seldom feasible, researchers 
must find a middle ground best suited to the questions posed in their 
studies. A powerful control for classroom is to randomly divide each 
classroom into equal halves, thirds, or quarters. These randomly formed 
groups then should be randomly assigned to treatments that differ only on 
one dimension .of the experimental design. For exaF.ple, a study examining 
tlie relative effects of knowledge and higher order questions should ran- 
domly h.alve classrooms and randomly assign each half to a level, knowledge 
questions or high order questions, of the experimental factor of type of 
questions. Any other independent variables should be held constant for 
the two groups formed from the same classroom. This assignment procedure 
can validly use students as the unit of analysis under the condition that 
teachers who taught each half were statistically and reasonab ly equivalent 
In their rendition of the treatment. This is because the half -cl assrooms 
differ randomly Vs^i th respect to the independent variable vjhich has been 
varied. Since all other independent variables are constant over the two 
halves, and since there has been random assignment to the experimental fac- 
tor varied, the likelihood of correlated errors is reduced considerably. 
The generalization of this reasoning to an independent variable with more 
than two levels is straightforward. 

In addition to the caveat that teachers in each half classroom be 
equivalent, two other cautions must be heeded. Aggregating information 
over classrooms requires that classrooms be randomly assigned to treatment 
conditions under the restriction outlined above. It also requires that 
where more tlian one classroom is assigned to the same treatment condition, 
great care be taken that the classrooms are not systematically different 
insofar as possible. This demands rigorous investigation of the sample 
characteristics at the level of classrooms. For example, classrooms should 
be compared on a measure of general ability like vocabulary, an interest or 
attitude survey relating to the teaching variables to be studied and the 
experimental curriculum, or other measures relevant to the particular 
investigation. 

The second caution in this method of attacking the problem of units of 
analysis is one pertaining to sample size. Dividing a classroom of 30 
students into a large number of groups yields only a small number of stu- 
dents per group. This can destroy the statistical power gained by the 
sampling procedure outlined above. There seem to be only two alternatives 
to the dilemma of wanting to ask many research questions with limited re- 
sources: ask fewer questions or sacrifice tl»e probability of identifying 
true differences due to decreases in pouter resulting from small saiuple sizes. 



4 



The latter option can be disastrous for building kno\7ledgo about teachers* 
ability to influence student learning since we play a dart game with a 
small board and unfeathered darts to begin with. The former option is to 
be valued. It requires that each study ask a few piercing questions ard 
that sets of studies he programatic so that the range of informatioii de- 
sired can be obtained over se-^^eral investigations. 

Analy sis. The studies reviewed here reveal both errors of omission 
and comisslon regardi]ig the appropriate use of statistical analyses* Per- 
haps most obvious in almost every study is the absence of an explicit 
statement that critical assumptions underlying .statistical analyses were 
tested and judged valid. Moreover, it waF shown that several important 
assumptions v;ere not justified in some studies. A preliminary test for the 
validity of assumptions is essential for good research which relies on sta- 
tistical analyses for inferences of the causal effects of an i.ndependent 
variable, livery study ought to state that assumptions were examined and 
how they were examined. The presence of two to six such sentences would 
greatly improve the interpretability of research on teaching. 

I also recommend that investigators report a measure of the proportion 
of variance in the dependent measure that can be accounted for by each or 
several of the independent variables. This measure need not be (o (see 
Hays, 1973), but this statistic or a suitable equivalent can be very 
informative about hov; influential the treatment is in determining values 
of the dependent variable. A treatment that exhibits a low value of i?> or 
an equivalent statistic is not a major determinant of scores on the depen- 
dent variable. It might be said that such a treatment is not pure or that 
the effects observed depend on other unidenti^*^ cd factors of their inter- 
actions vjitli the independent variable manipulated in the research. Thus, 
such treatments need dissection or further consideration in future research 
before they can be accepted as the causal agent promoting the differences 
observed betv;een treatment groups that are statistically different at a 
given level of significance. 

Finally, where continuous variables such as prior achievement, general 
intelligence, and the like are used .s ^'controlling" variables, these vari- 
ables should enter the analyses as thay are, not as blocking variables. 
Analyses which use a median-split or tripartite blocking on continuous data 
lose considerable statistical power (see Cronbach & Snow, in press). 
It is a much better arrangement to use a general linear model or generalized 
regression analysis in which the continuous variables are forced to be the 
first variable used to partition variance in the dependent measure. These 
procedures are further described by Walberg (1971) and Cohen (1968). 

Dependent M easures . Several points about necessary characteristics of 
measures of student achievement also have been shown to have great influ- 
ence in judging the quality of research on teaching. The reliability of an 
achievement test is a key factor in the faith which can be placed in differ^ 
ences observed between treatments since a low reliability indicates that 
students' scores are more reflective of chance variation in test responses 
rather than variation attributable to true ability. This statement, of 
course, rests on the tenets of classical measurement theory (e.g., see 
Gulliksen, 1950). 



• 17. 



The rttliabilil:y of a dependent measure also has an influence on the 
power of sLatiscical analysefi. A low reliability, i.e., large error score 
components, v/ill inflate the error term in an analysis, thus decreasing the 
possibility of clainung differences when they may really be present. 

As discussed previously, all students should have an equal opportunity 
to respond to all itums of a test ncasurinp, learning. Furt.iermore , unless 
speed is a natural eleiaent In a particular type of seliooi leirning, measures 
of achievement should not be ovc;rly speeded. Not only does this tend to 
spuriously inflate estimates of reliability, but since students do not reach 
some or many test items, it also tends to decrease the content validity of 
the sample of items chosen to reflect the domain of instructior . 

General Questions of Method . Two points unaddressed in the foregoing 
merit consideration in any study of research on teaching. Learning some 
part of a curriculum is not a short lived event. The content material and 
processes of thought encouraged by exposure to a curriculum via particular 
methods of teaching are not independent of all that students have acquired 
by previous experience and instruction. Relative to the knowledge and abil- 
ities which students can bring to a learning situation, a short instructional 
period is not likely to be particularly outstanding as an agent for changing 
student achievement scores. That is, of course, unless the measurii of 
achievement focuses on little more than rote recall. The duration for 
instruction ought to be at least some small number of lessons, say five to 
ten. This will allow students some time to adjust to the style of teaching 
from v;hich they are to learn. It also will allov; the presentation of 
material that requires students to comprehend information in the sense of 
being able to manipulate it relative to a purpose. 

A second general point is that research on teaching should be more 
attuned to the learning characteristics of students being taught. For ex- 
ample, only one study on teachers' questioning strategies (Martikean, 1973) 
has raised the issue of whether the distinction betv;een knowledge questions 
and higher cognitive questions is appropriate at all developmental levels. 
The studies reviewed here show the same type of question used to teach 
second graders as eleventh graders. Changes in memory span, ability to 
organize inf oriuation, and the acquisition of strategies for reasoning may 
be quite variant for students in the age ranges of eight to sixteen years. 
Yet, research seems to assume blithely that these differences are unworthy 
of mention, no less direct question. The question of aptitude-treatment 
interactions -is an important one for research on teaching and should not be 
dismissed casually (cf. Cronbach & Snow, in press). 

Conclasion 

This review of experiments v;hich examined tlie effects of teacher 
questions on student achievement was done under .the assumption that the 
label experimental study was not sufficient proof for accepting conclusions 
put forth by the investigators. Intense consideration of the methodology 
of the twelve studies in this area sliowed tliat nine of them probably could 
not speak validly to the degree of influence that teacher questions have on 
student achievement. Of the three studies (Buggey, 1971; Savage, 1972; 



•18. 



Tyler, 1971) that were relatively sound nicChodologically , only two obtained 
differences for students v/ho studied the material on which they were tested. 
Buggey's (1971 study suggests that higher cognitive questions lead to im- 
proved achievenient relative to lower cognitive questions for second graders. 
Tyler's (197.1) dissertation implies that questions framed by teachers are 
more effective than questions pier.ented in text for second graders. 
Savage's (1972) failure to replicate Buggey's (1971) results in the fifth 
grade could result from several factors, some of which may be differences 
in students' level of development, the inappropriateness of the same in- 
structional materials for one of the two grade levels, different teachers 
teaching different halves of the two unit curriculum, and so forth. One 
telling statistic is that the no study group of second graders had a mean 
posttest score of 15.89 while that for the fifth graders was 30.00. This 
suggests a large difference in prior knowledge, general reasoning ability, 
test wiseness, or some combination of these and other factors. 

The presence of only these three studies does not provide a sturdy 
base for generalizations about the effects of teacher questions on student 
achievement. Perhaps more important, however, this review has shown that 
consumers of educational research need to be alert when reading studies and 
reviews of experimental investigations. Attributing causality is not a 
product easily obtained by conducting an experiment. Considerable care and 
expertise are required in designing, analyzing, interpreting, and reporting 
good research. This paper has offered several suggestions and alternatives 
for bettering these practices, although it is by no means definitive on 
these concerns. • It is important that the quality of educational research 
be high so that neither further research nor educational practice is misled 
by superficial cJai-ns to strong method. 

I) 



• 19, 



References 

Aagaard, S. A. Oral questioning by the teacher: Influence on student 
achievement in eleventh grade chemistry (Doctoral dissertation, 
New York University, 1973) • 

Beseda, C* G. Levels of questioning used by student teachers and its 

effect on pupil achievement and critical thinking ability. (Doctoral 
dissertation, North Texas State University, 1972). 

Buggey, L. J. A study of the relationship of classroom questions and social 
studies achievement of second-grade children (Doctoral dissertation. 
University of Washington, 1971). 

Church, J. An experimental study of differing teaching techniques in the 
teaching of a science topic at the standard four level. Unpublished 
manuscript. University of Canterbury, New Zealand, 1970. 

Cohen, J. Multiple regression as a general data-analytic system. Psychological 
Bulletin , 1968, 70, 426^443. 

Cronbach, L. J. & Snow, R. E. Aptitudes and instructional methods . New York: 
Irvington Publishers, in press. 

Gall, M. D. The use of questions in teaching. Review of Educational Research , 
1970, 40, 707-721. 

Gulliksen, H. Theory of mental tests. New York: Wiley, 1950. 

Hays, W. L. Statis tics for the social sciences . New York: Holt, Rinehart, 
and Winston, 1973. 

Heath, R. W. & Nielsen, M. A. The research basis for performance-based 

teacher education. Review of Educational Research , 1974, 44, 463-484. 

Lynch, W. W. , Ames, C, Barger, C., Frazer, W., HUlman, S., & Wisehart, S. 
Effects of teachers' cognitive demand styles on pupil learning (Final 
Report 30.3). Bloomington, IN: Center f')r Innovation in Teaching the 
Handicapped, Indiana University, February, 1973. 

Martikean, A. The levels of questioning and their effects upon student 

performance above the knowledge level on Bloom's taxonomy of educational 
objectives (Field Research and Development Study - E585) . Gary, IN: 
Indiana University Northv/est, February, 1973. 

Millett, G. B. Comparison of four teacher training procedures in achieving 
teacher and pupil "translation" behaviors in secondar> school social 
studies (Doctoral dissertation, Stanford University, 1967). 



•20. 



Rogers, V. M. & Davis, 0. L. Varying the cognitive levels of classroom 
questions: An analysis of student teachers' questions and pupil 
achievement in elementary social studies. Paper presented at the 
1970^"^ °^ ^^^^ American Educational Research Association, Lexington, 

Ryan, F. L. The effects on social studies achievement of multiple student 

responding to different levels of questioning. Journal of Experimental 
Education, 1974, 4^, 71-75. 

Ry-m, F. L. Differentiated effects of levels of questioning on student 
achievement. Joiiriial of Experimental Education , 1973, 41, 63-67. 

Savage, T. V. A study of the relationship of classroom questions and social 
studies achievement of fifth grade children (Doctoral dissertation. 
University of Washington, 1972). 

Tyler, J. F. A study of the relationship of two methods of question presentation, 
sex, and school location to the social studies achievement of second grade 
children ^Doctoral dissertation. University of Washington, 1971). 

Walberg, H. J. Generalized regression models in educational research. America n 
Educational Research Journal. 1971, 8, 71-91. 



